Use the built-in calendar module to provide lists of day and month names and abbreviations.
import calendar
from pyparsing import oneOf
monthName = oneOf( list(calendar.month_name)[1:] )
monthAbbr = oneOf( list(calendar.month_abbr)[1:] )
dayName = oneOf( list(calendar.day_name) )
dayAbbr = oneOf( list(calendar.day_abbr) )
# parse action to convert month_abbr to value 1-12
mname2mon = dict((m,i) for i,m in enumerate(calendar.month_abbr) if m)
monthAbbr.setParseAction(lambda t: mname2mon[t[0]])
Chemical symbols of the elements
All of the elements, in a MatchFirst expression (oneOf will reorder the entries as necessary to make sure the "H" does not mask "He" or "Hg", for example).
element = oneOf( """H He Li Be B C N O F Ne Na Mg Al Si P S Cl
Ar K Ca Sc Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge
As Se Br Kr Rb Sr Y Zr Nb Mo Tc Ru Rh Pd Ag
Cd In Sn Sb Te I Xe Cs Ba Lu Hf Ta W Re Os
Ir Pt Au Hg Tl Pb Bi Po At Rn Fr Ra Lr Rf
Db Sg Bh Hs Mt Ds Rg Uub Uut Uuq Uup Uuh Uus
Uuo La Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm
Yb Ac Th Pa U Np Pu Am Cm Bk Cf Es Fm Md No""" )
or
element = Regex("A[cglmrstu]|B[aehikr]?|C[adeflmorsu]?|D[bsy]|"
"E[rsu]|F[emr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airu]|"
"M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|"
"S[bcegimnr]?|T[abcehilm]|Uu[bhopqst]|U|V|W|Xe|Yb?|Z[nr]")
UUIDs
Parses UUIDs, such as 'db9674c4-72a9-4ab9-9ddd-1d641a37cde4'.
_hexStr = lambda n : Word(hexnums,exact=n)
uuid = Combine(_hexStr(8)+"-"+_hexStr(4)+"-"+_hexStr(4)+"-"+_hexStr(4)+"-"+_hexStr(12))
tzname = oneOf("""ACDT ACST ADT AEDT AEST AFT AHDT AHST AKDT AKST
AMST AMT ANAST ANAT ART AST AT AWDT AWST AZOST AZOT AZST AZT
BADT BAT BDST BDT BET BNT BORT BOT BRA BST BT BTT CAT CCT CDT
CEST CET CHADT CHAST CKT CLST CLT COT CST CUT CVT CWT CXT DAVT
DDUT DNT DST EASST EAST EAT ECT EDT EEST EET EGST EGT EMT EST
FDT FJST FJT FKST FKT FST FWT GALT GAMT GEST GET GFT GILT GMT
GST GT GYT GZ HAA HAC HADT HAE HAP HAR HAST HAT HAY HDT HFE
HFH HG HKT HL HNA HNC HNE HNP HNR HNT HNY HOE HST ICT IDLE
IDLW IDT IOT IRDT IRKST IRKT IRST IRT IST IT ITA JAVT JAYT JST
JT KDT KGST KGT KOST KRAST KRAT KST LHDT LHST LIGT LINT LKT
LST LT MAGST MAGT MAL MART MAT MAWT MDT MED MEDST MEST MESZ
MET MEWT MEX MEZ MHT MMT MPT MSD MSK MSKS MST MT MUT MVT MYT
NCT NDT NFT NOR NOVST NOVT NPT NRT NST NSUT NT NUT NZDT NZST
NZT OESZ OEZ OMSST OMST OZ PDT PET PETST PETT PGT PHOT PHT PKT
PMDT PMT PNT PONT PST PWT PYST PYT RET ROK SADT SAST SBT SCT
SET SGT SRT SST SWT SZ TAI TFT THA THAT TJT TKT TMT TOT TRUK
TST TUC TVT ULAST ULAT UT UTC UTZ UYT UZT VET VLAST VLAT VTZ
VUT WAKT WAST WAT WCT WEST WESZ WET WEZ WFT WGST WGT WIB WITA
WIT WST WTZ WUT WZ YAKST YAKT YAPT YDT YEKST YEKT YST""")
us_tzname = oneOf("EST EDT CST CDT MST MDT PST PDT AKST AKDT HAST HADT HST")
us_tzname = Regex("(([ECMP]|HA|AK)[SD]|HS)T")
lower48us_tzname = oneOf("EST EDT CST CDT MST MDT PST PDT")
lower48us_tzname = Regex("[ECMP][SD]T")
US State postal abbreviations
Postal abbreviations for US states and territories.
stateAbbreviation = oneOf("""AA AE AK AL AP AR AS AZ CA CO CT DC DE
FL FM GA GU HI IA ID IL IN KS KY LA MA MD ME MH MI MN MO MP MS
MT NC ND NE NH NJ NM NV NY OH OK OR PA PR PW RI SC SD TN TX UT
VA VI VT WA WI WV WY""")
stateAbbreviation = Regex(r"A[AEKLPSRZ]|C[OAT]|D[EC]|F[LM]|G[UA]|HI|"
r"I[LNAD]|K[SY]|LA|M[EDAONIHTPS]|N[HJMCDEYV]|O[KHR]|P[ARW]|RI|"
r"S[CD]|T[XN]|UT|V[AIT]|W[AIVY]")
states = {
'AA' : 'Armed Forces Americas (except Canada)',
'AE' : 'Armed Forces Middle East',
'AK' : 'ALASKA',
'AL' : 'ALABAMA',
'AP' : 'Armed Forces Pacific',
'AR' : 'ARKANSAS',
'AS' : 'AMERICAN SAMOA',
'AZ' : 'ARIZONA',
'CA' : 'CALIFORNIA',
'CO' : 'COLORADO',
'CT' : 'CONNECTICUT',
'DC' : 'DISTRICT OF COLUMBIA',
'DE' : 'DELAWARE',
'FL' : 'FLORIDA',
'FM' : 'FEDERATED STATES OF MICRONESIA',
'GA' : 'GEORGIA',
'GU' : 'GUAM',
'HI' : 'HAWAII',
'IA' : 'IOWA',
'ID' : 'IDAHO',
'IL' : 'ILLINOIS',
'IN' : 'INDIANA',
'KS' : 'KANSAS',
'KY' : 'KENTUCKY',
'LA' : 'LOUISIANA',
'MA' : 'MASSACHUSETTS',
'MD' : 'MARYLAND',
'ME' : 'MAINE',
'MH' : 'MARSHALL ISLANDS',
'MI' : 'MICHIGAN',
'MN' : 'MINNESOTA',
'MO' : 'MISSOURI',
'MP' : 'NORTHERN MARIANA ISLANDS',
'MS' : 'MISSISSIPPI',
'MT' : 'MONTANA',
'NC' : 'NORTH CAROLINA',
'ND' : 'NORTH DAKOTA',
'NE' : 'NEBRASKA',
'NH' : 'NEW HAMPSHIRE',
'NJ' : 'NEW JERSEY',
'NM' : 'NEW MEXICO',
'NV' : 'NEVADA',
'NY' : 'NEW YORK',
'OH' : 'OHIO',
'OK' : 'OKLAHOMA',
'OR' : 'OREGON',
'PA' : 'PENNSYLVANIA',
'PR' : 'PUERTO RICO',
'PW' : 'PALAU',
'RI' : 'RHODE ISLAND',
'SC' : 'SOUTH CAROLINA',
'SD' : 'SOUTH DAKOTA',
'TN' : 'TENNESSEE',
'TX' : 'TEXAS',
'UT' : 'UTAH',
'VA' : 'VIRGINIA',
'VI' : 'VIRGIN ISLANDS',
'VT' : 'VERMONT',
'WA' : 'WASHINGTON',
'WI' : 'WISCONSIN',
'WV' : 'WEST VIRGINIA',
'WY' : 'WYOMING',
}
# add parse action to convert abbreviation to full state name
stateAbbreviation.setParseAction(lambda t:states[t[0]])
E-mail addresses
E-mail addresses are notorious for having many tortuous and arcane forms, since e-mail has been around since the early days of the internet, and so has evolved through many "standards". The expression below comes from this website: http://www.regular-expressions.info/email.html, and covers most common e-mail addresses in use today.
['paul@users.sourceforge.net']
- domain: net
- hostname: users.sourceforge
- user: paul
ANSI Terminal Escape Sequences
Back in the day, computers were huge distant servers walled off in The Computer Room. On your desk, you probably had a "dumb" terminal, like a VT100. These terminals supported a special language of escape sequences to move the cursor about, clear parts of the screen, change the screen scroll region, display in color, flashing, bold, or reverse, and so on. Some sequences would even flash the lights on the keyboard. Sometimes you would retrieve a log of one of these terminal sessions, and it would be littered with the control sequences. This parser is generic enough to match all or most of them.
Table of Contents
Days of the week / Months of the year
Use the built-in calendar module to provide lists of day and month names and abbreviations.Chemical symbols of the elements
All of the elements, in a MatchFirst expression (oneOf will reorder the entries as necessary to make sure the "H" does not mask "He" or "Hg", for example).element = oneOf( """H He Li Be B C N O F Ne Na Mg Al Si P S Cl Ar K Ca Sc Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge As Se Br Kr Rb Sr Y Zr Nb Mo Tc Ru Rh Pd Ag Cd In Sn Sb Te I Xe Cs Ba Lu Hf Ta W Re Os Ir Pt Au Hg Tl Pb Bi Po At Rn Fr Ra Lr Rf Db Sg Bh Hs Mt Ds Rg Uub Uut Uuq Uup Uuh Uus Uuo La Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm Yb Ac Th Pa U Np Pu Am Cm Bk Cf Es Fm Md No""" )orelement = Regex("A[cglmrstu]|B[aehikr]?|C[adeflmorsu]?|D[bsy]|" "E[rsu]|F[emr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airu]|" "M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|" "S[bcegimnr]?|T[abcehilm]|Uu[bhopqst]|U|V|W|Xe|Yb?|Z[nr]")UUIDs
Parses UUIDs, such as 'db9674c4-72a9-4ab9-9ddd-1d641a37cde4'.or (this version requires pyparsing 1.4.10)
_hexStr = lambda n : Word(hexnums,exact=n) uuid = Combine(_hexStr(8) + ("-"+_hexStr(4))*3 + "-" + _hexStr(12))MAC Address
Parses MAC addresses, in the form of 6 pairs of hex digits. (requires pyparsing 1.4.10)_hex2 = Word(hexnums,exact=2) macAddr = Combine( _hex2 + (("-" + _hex2)*5 | (":" + _hex2)*5) )Timezone names
Worldwide and US-only lists of timezone names. (from http://www.worldtimezone.com/wtz-names/timezonenames.html)tzname = oneOf("""ACDT ACST ADT AEDT AEST AFT AHDT AHST AKDT AKST AMST AMT ANAST ANAT ART AST AT AWDT AWST AZOST AZOT AZST AZT BADT BAT BDST BDT BET BNT BORT BOT BRA BST BT BTT CAT CCT CDT CEST CET CHADT CHAST CKT CLST CLT COT CST CUT CVT CWT CXT DAVT DDUT DNT DST EASST EAST EAT ECT EDT EEST EET EGST EGT EMT EST FDT FJST FJT FKST FKT FST FWT GALT GAMT GEST GET GFT GILT GMT GST GT GYT GZ HAA HAC HADT HAE HAP HAR HAST HAT HAY HDT HFE HFH HG HKT HL HNA HNC HNE HNP HNR HNT HNY HOE HST ICT IDLE IDLW IDT IOT IRDT IRKST IRKT IRST IRT IST IT ITA JAVT JAYT JST JT KDT KGST KGT KOST KRAST KRAT KST LHDT LHST LIGT LINT LKT LST LT MAGST MAGT MAL MART MAT MAWT MDT MED MEDST MEST MESZ MET MEWT MEX MEZ MHT MMT MPT MSD MSK MSKS MST MT MUT MVT MYT NCT NDT NFT NOR NOVST NOVT NPT NRT NST NSUT NT NUT NZDT NZST NZT OESZ OEZ OMSST OMST OZ PDT PET PETST PETT PGT PHOT PHT PKT PMDT PMT PNT PONT PST PWT PYST PYT RET ROK SADT SAST SBT SCT SET SGT SRT SST SWT SZ TAI TFT THA THAT TJT TKT TMT TOT TRUK TST TUC TVT ULAST ULAT UT UTC UTZ UYT UZT VET VLAST VLAT VTZ VUT WAKT WAST WAT WCT WEST WESZ WET WEZ WFT WGST WGT WIB WITA WIT WST WTZ WUT WZ YAKST YAKT YAPT YDT YEKST YEKT YST""") us_tzname = oneOf("EST EDT CST CDT MST MDT PST PDT AKST AKDT HAST HADT HST") us_tzname = Regex("(([ECMP]|HA|AK)[SD]|HS)T") lower48us_tzname = oneOf("EST EDT CST CDT MST MDT PST PDT") lower48us_tzname = Regex("[ECMP][SD]T")US State postal abbreviations
Postal abbreviations for US states and territories.stateAbbreviation = oneOf("""AA AE AK AL AP AR AS AZ CA CO CT DC DE FL FM GA GU HI IA ID IL IN KS KY LA MA MD ME MH MI MN MO MP MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR PW RI SC SD TN TX UT VA VI VT WA WI WV WY""") stateAbbreviation = Regex(r"A[AEKLPSRZ]|C[OAT]|D[EC]|F[LM]|G[UA]|HI|" r"I[LNAD]|K[SY]|LA|M[EDAONIHTPS]|N[HJMCDEYV]|O[KHR]|P[ARW]|RI|" r"S[CD]|T[XN]|UT|V[AIT]|W[AIVY]") states = { 'AA' : 'Armed Forces Americas (except Canada)', 'AE' : 'Armed Forces Middle East', 'AK' : 'ALASKA', 'AL' : 'ALABAMA', 'AP' : 'Armed Forces Pacific', 'AR' : 'ARKANSAS', 'AS' : 'AMERICAN SAMOA', 'AZ' : 'ARIZONA', 'CA' : 'CALIFORNIA', 'CO' : 'COLORADO', 'CT' : 'CONNECTICUT', 'DC' : 'DISTRICT OF COLUMBIA', 'DE' : 'DELAWARE', 'FL' : 'FLORIDA', 'FM' : 'FEDERATED STATES OF MICRONESIA', 'GA' : 'GEORGIA', 'GU' : 'GUAM', 'HI' : 'HAWAII', 'IA' : 'IOWA', 'ID' : 'IDAHO', 'IL' : 'ILLINOIS', 'IN' : 'INDIANA', 'KS' : 'KANSAS', 'KY' : 'KENTUCKY', 'LA' : 'LOUISIANA', 'MA' : 'MASSACHUSETTS', 'MD' : 'MARYLAND', 'ME' : 'MAINE', 'MH' : 'MARSHALL ISLANDS', 'MI' : 'MICHIGAN', 'MN' : 'MINNESOTA', 'MO' : 'MISSOURI', 'MP' : 'NORTHERN MARIANA ISLANDS', 'MS' : 'MISSISSIPPI', 'MT' : 'MONTANA', 'NC' : 'NORTH CAROLINA', 'ND' : 'NORTH DAKOTA', 'NE' : 'NEBRASKA', 'NH' : 'NEW HAMPSHIRE', 'NJ' : 'NEW JERSEY', 'NM' : 'NEW MEXICO', 'NV' : 'NEVADA', 'NY' : 'NEW YORK', 'OH' : 'OHIO', 'OK' : 'OKLAHOMA', 'OR' : 'OREGON', 'PA' : 'PENNSYLVANIA', 'PR' : 'PUERTO RICO', 'PW' : 'PALAU', 'RI' : 'RHODE ISLAND', 'SC' : 'SOUTH CAROLINA', 'SD' : 'SOUTH DAKOTA', 'TN' : 'TENNESSEE', 'TX' : 'TEXAS', 'UT' : 'UTAH', 'VA' : 'VIRGINIA', 'VI' : 'VIRGIN ISLANDS', 'VT' : 'VERMONT', 'WA' : 'WASHINGTON', 'WI' : 'WISCONSIN', 'WV' : 'WEST VIRGINIA', 'WY' : 'WYOMING', } # add parse action to convert abbreviation to full state name stateAbbreviation.setParseAction(lambda t:states[t[0]])E-mail addresses
E-mail addresses are notorious for having many tortuous and arcane forms, since e-mail has been around since the early days of the internet, and so has evolved through many "standards". The expression below comes from this website: http://www.regular-expressions.info/email.html, and covers most common e-mail addresses in use today.emailExpr = Regex(r"(?P<user>[A-Za-z0-9._%+-]+)@(?P<hostname>[A-Za-z0-9.-]+)\.(?P<domain>[A-Za-z]{2,4})")The named re fields get translated into results names by pyparsing. For example:
print emailExpr.parseString("paul@users.sourceforge.net").dump()prints out:ANSI Terminal Escape Sequences
Back in the day, computers were huge distant servers walled off in The Computer Room. On your desk, you probably had a "dumb" terminal, like a VT100. These terminals supported a special language of escape sequences to move the cursor about, clear parts of the screen, change the screen scroll region, display in color, flashing, bold, or reverse, and so on. Some sequences would even flash the lights on the keyboard. Sometimes you would retrieve a log of one of these terminal sessions, and it would be littered with the control sequences. This parser is generic enough to match all or most of them.ESC = Literal('\x1b') integer = Word(nums) escapeSeq = Combine(ESC + '[' + Optional(delimitedList(integer,';')) + oneOf(list(alphas)))